Across-environment seed protein stability and genetic architecture of seed components in soybean

The recent surge in the plant-based protein market has resulted in high demands for soybean genotypes with improved grain yield, seed protein and oil content, and essential amino acids (EAAs). Given the quantitative nature of these traits, complex interactions among seed components, as well as between seed components and environmental factors and management practices, add complexity to the development of desired genotypes. In this study, the across-environment seed protein stability of 449 genetically diverse plant introductions was assessed, revealing that genotypes may display varying sensitivities to such environmental stimuli. The EAAs valine, phenylalanine, and threonine showed the highest variable importance toward the variation in stability, while both seed protein and oil contents were among the explanatory variables with the lowest importance. In addition, 56 single nucleotide polymorphism (SNP) markers were significantly associated with various seed components. Despite the strong phenotypic Pearson’s correlation observed among most seed components, many independent genomic regions associated with one or few seed components were identified. These findings provide insights for improving the seed concentration of specific EAAs and reducing the negative correlation between seed protein and oil contents.

sulfur-containing amino acids cysteine and methionine are relatively lower than desired levels for the optimum growth of monogastric animals such as poultry and swine by 2.0 and 3.0%, respectively 12 .Soybean meal protein is also deficient in the EAAs threonine and lysine as the sole dietary source of protein for animal feed, and a 10% increase in the concentration of these amino acids would improve the commercial value of soybean meal 13 .
Given that seed protein and oil content and amino acids (EAAs, SEAAs, and NEAAs) composition are key factors in determining soybean seed quality for feed and food uses, it is important to optimize the seed composition and ratio of amino acids in soybean cultivars [14][15][16][17] .Currently, seed protein content in soybean cultivars ranges between 330 and 390 g kg −1 at 130 g kg −1 moisture content 18 .Historical breeding efforts for soybean cultivars with higher seed oil content and yield potential have negatively impacted seed protein content 19,20 .Therefore, the development and identification of cultivars and germplasm with high seed protein content and appropriate ratio of amino acid concentration may play an important role to supply plant-based protein for human and animal nutrition.
Several genomic regions associated with seed protein as well as amino acid composition in soybean have been reported.According to Soybase.org, there are currently over 250 reported quantitative trait loci (QTLs) associated with soybean seed protein content, and a much lower number of QTLs and genomic regions associated with various amino acids (www.soyba se.org).The first report of QTLs associated with soybean seed protein dates back more than 30 years when Diers et al. (1992) reported significant associations on chromosomes 20 (linkage group (LG) I) and 15 (LG E) 21 .Subsequent studies have reported and confirmed the genomic regions identified on chromosomes 20 and 15 using populations with various genetic backgrounds, tested in diverse environments, and genotyped with distinct molecular markers density [22][23][24][25][26][27][28] .
Although genetics account for a larger portion of the phenotypic variation in soybean seed protein content compared to the environment, genotype by environment interaction (G×E) significantly affects seed protein and amino acid composition in soybean 29,30 .Various environmental covariates affect soybean seed protein content and composition, including temperature [31][32][33][34][35] , water stress 36,37 , and nitrogen availability [36][37][38] .However, the rationale behind genotypes showing variable across-environment seed protein stability under different environmental conditions is still unknown.To date, limited studies have investigated the effect of various seed components (seed protein and oil contents, and EAAs concentration) on seed protein stability across diverse environments.Therefore, the goals of this study were to i) characterize the genetic architecture of seed protein and oil contents, as well as EAAs concentration in a panel of genetically diverse soybean accessions and ii) identify seed components that are contributing to the across-environment seed protein stability.

Plant materials and data collection
A total of 449 genetically diverse plant introductions (PIs) ranging from maturity group (MG) 4 to 6 were used in this study.The distribution of PIs based on maturity consisted of 126 MG 4 PIs, 181 MG 5 PIs, and 142 MG 6 PIs.Accessions were obtained from the USDA Soybean Germplasm Collection and originated from 16 countries, including South Korea (173), Japan (116), United States (67), China (59), North Korea (11), Argentina (3), Mexico (3), one accession from India, Morocco, Pakistan, South Africa, Turkey, Uganda, Vietnam, and Zimbabwe, respectively, and nine accessions with an unknown origin.All entries were genotyped with the SoySNP50K BeadChip 39 and data has been made available by authors at the SoyBase Genetics and Genomics Database (http:// soyba se.org/ snps/ downl oad.php) 40 .Non-polymorphic single nucleotide polymorphisms (SNPs) were removed, resulting in 41,080 SNPs being used in the genome-wide association studies (GWAS).We confirm that our field studies involving plants, whether cultivated or wild, adhered to applicable institutional, national, and international guidelines and legislation.
Field trials were conducted in seven environments (unique combination of location and year) at the Milo J. Shult Agricultural Research and Extension Center in Fayetteville, AR in 2017 and 2018 (36°05′56′′ N, 94°10′42′′ W), the Northeast Research & Extension Center in Keiser, AR in 2017 and 2018 (35°40′27''N, 90°05′15′′ W), the Rice Research & Extension Center in Stuttgart, AR in 2017 (34°28′31′′ N, 91°25′07′′ W), and the Rohwer Research Station in Rohwer, AR in 2017 and 2018 (33°48′35′′ N, 91°16′13′′ W).Trials were conducted using a randomized design.Each plot consisted of one 3-m long row with 0.75 m row space and approximately 100 seeds were sowed per plot.At physiological maturity 41 , seeds were hand-harvested.A 20-g seed sample per genotype per environment was used for near-infrared reflectance spectroscopy (NIR) analysis using an IM 9500 Grain Analyzer (Perten Instruments®, Hägersten, Sweden) to determine seed protein and oil content, and EAAs concentration.The high protein cultivar 'Osage' 42 was used as the reference check.Multi-environmental data (dry weight basis) was analyzed in R 43 .Adjusted means across environments were calculated utilizing the package 'lmerTest' 44 based on a mixed-effects linear model fitted using the package 'lme4' 45 .The model included the fixed effect of genotype and the random effect of the environment (location × year).
To measure the reliability and consistency of the collected phenotype, Cronbach's alpha (α) score 46 was calculated for each trait using the R package 'psych' 47 following Eq.(1).To understand the inter-relatedness between seed components, Pearson's correlation coefficients were calculated between all possible pairs of traits based on the adjusted means utilizing the function 'cor' in R.
where k is the number of observations per seed component, c is the average inter-item covariance of observed seed components between each pair of environments averaged for all pairs of environments, and v is the average variance of each observed seed component across all environments. (1)

Seed components effect on across-environment seed protein stability
To obtain the across-environment seed protein stability, Shukla's Stability Variance (σi 2 ) 48 was calculated using the R package 'statgenG×E' 49 .Shukla's stability refers to the variance around the genotype's phenotypic mean across all environments.It partitions the total G×E term into unbiased estimates of the G×E variance attributable to each genotype which provides a measure of stability without accounting for the performance of the genotype 50 .Shukla's Stability Variance considers a genotype stable if its response to multiple environments is parallel to the mean response of all genotypes in the test 51 .
To determine the effect of seed components on seed protein stability (metric derived from Shukla's Stability Variance), the variable importance was calculated based on the "Gini Importance" obtained from a Random Forest model (RF) 52 .RF is a supervised learning algorithm based on the assembly of multiple decision trees which conducts feature selection and generates non-correlated decision trees 52 .In summary, the Gini Importance indicates how often a variable is selected for a given split across all trees and the magnitude of its discriminative value (decrease in node impurity) for the classification problem 52,53 .The 'Gini Importance' for each seed component was retrieved using the function 'importance' of the R package 'caret' 54 .The model was fitted using Shukla's Stability Variance as the variable response and the EAAs concentration as explanatory variables utilizing the function 'train' of the R package 'caret' with the method set to 'rf' .The dataset was divided into training and testing sets consisting of 70 and 30% of the observations, respectively.The model was tuned by going through a grid search for the hyperparameter 'mtry' (1-5, number of variables randomly collected to be sampled at each split time), with 'ntree' set at 1000 (number of trees per aggregation) with ten-fold cross-validation.Both response and explanatory variables were standardized into z-scores (observed value subtracted by the sample average, and divided by the sample standard deviation) utilizing the function 'scale' in R.

Genetic architecture of seed components
To unveil the genetic architecture of seed protein and oil contents, as well as EAAs concentration, genomewide association studies (GWAS) were conducted to detect significant marker-trait associations between SNPs and each seed component.The Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway (BLINK) 55 model was applied to conduct GWAS using the R package "GAPIT" 56 .The adjusted means across environments for each trait were used as the variable response in each model.In essence, BLINK represents an enhanced version of the Fixed and Random Model Circulating Probability Unification (FarmCPU) 57 .FarmCPU leverages the benefits of both mixed linear models and stepwise regression through their iterative use.It replaces kinship with a set of molecular markers, treated as fixed effects, which are individually tested at each iteration across the genome.These markers are optimized using a restricted maximum likelihood approach within a mixed linear model framework.This approach utilizes the variance and covariance established by the pre-selected markers, limiting the risk of model overfitting 57 .BLINK discards the assumption of even gene distribution related to a trait throughout the genome.It replaces the restricted maximum likelihood with Bayesian Information Content (BIC), leading to enhanced computational speed and efficiency 55 .

Phenotype consistency and reliability across environments
Across all environments, both seed protein and oil contents have shown high Cronbach's alpha (α) scores with confidence intervals ranging from 0.89 to 0.92 and 0.90 to 0.92, respectively, and minimum estimated error variance (0.17 and 0.15, respectively) (Table 1).Across genotypes, seed protein content averaged 440 g kg −1 with a standard deviation (SD) of 15 g kg −1 ; seed oil content averaged 190 g kg −1 with an SD of 10 g kg −1 .Overall, EAAs have shown high reliability and consistency across environments with α scores ranging from 0.82 to 0.92 (Table 1).Tryptophan was the EAA with lowest α scores (confidence interval ranging from 0.55 to 0.66), although the obtained values are still within the acceptable consistent range [58][59][60][61]  as the correlation of the test with itself, of which the error variance is obtained by subtracting the squared α from 1.00 62 .In plant breeding, α scores can indicate the experimental quality and reliability of measured phenotypes.Like heritability, it can be explored as a measurement of the influence of genetic vs. nongenetic 63 .

Pearson's correlation among seed components
A normal distribution was observed for each seed-related trait confirming the availability of substantial phenotypic variance among genotypes (Fig. 1).Across genotypes, seed protein content was significantly correlated with seed oil content (− 0.697) and EAAs (0.283-0.985) (Fig. 1).As expected, there was a strong negative correlation between seed protein and oil contents, probably due to closely linked and/or pleiotropic loci regulating the traits' expression.Although seed protein content was significantly correlated with all EAAs, methionine (0.736) and tryptophan (0.283) showed a lower correlation as compared to other EAAs (Fig. 1).Seed oil content, on the other hand, showed significant negative correlations with all EAAs (− 0.673 to − 0.467), except tryptophan (− 0.011).Interestingly, the strength of correlations between seed oil content and EAAs was not proportional to the strength of correlation between seed protein and oil.This indicates that investigating the composition of EAAs may be an interesting approach for increasing overall seed protein content, as well as reducing the degree of negative correlation between seed protein and oil.In addition, although the strength of the correlation between seed protein content and EAAs was similar, the correlation among EAAs does not follow a similar strength pattern indicating that it may be possible to modify the composition of soybean meal by manipulating individual EAAs through genetic recombination.To understand the influence of seed components (seed protein and oil contents, and EAAs concentration) on the variation of seed protein stability across environments, the variable importance of each seed component was calculated based on the "Gini Importance" obtained from a Random Forest model (RF).The importance ranged from 17.68 (Lysine) to 59.24 (Valine), with an average of 28.94 (Fig. 2).Overall, the EAAs valine (59.24), phenylalanine (40.19), and threonine (33.84) were the variables with the highest importance, whereas lysine (17.69), isoleucine (19.06), seed oil content (19.85), leucine (20.27), seed protein content (20.93), and tryptophan (25.68) were the variables with the lowest importance in seed protein stability (Fig. 2).In general, seed oil content, methionine, and tryptophan were inversely proportional to across-environment seed protein stability, while seed www.nature.com/scientificreports/protein content, isoleucine, leucine, lysine, phenylalanine, threonine, and valine were directly proportional to across-environment seed protein stability (Figure S1).

Discussion
The increase in consumer awareness regarding the socio-environmental impact of food production practices led to substantial growth in the consumption of plant-based protein products 64 .Plant-based protein products provide various environmental benefits compared to animal-derived products, including reduced land, water, and energy requirements.This supports the sustainability of the food production chain, resulting in a lower overall carbon footprint 65 .Hence, the global market for plant-based meat is growing at a fast pace of 14.8% yearly and is estimated to surpass $30 billion by 2026 66 .Soybean protein is a major source of plant-based products accounting for over 70% of global plant-based protein meal for animal feed 2 .Given the importance of seed composition towards soybean seed quality and overall value, breeding programs have emphasized the development of high-yielding cultivars with improved seed composition.However, there are complex interactions among seed components, as well as between seed components and grain yield 18,28,[67][68][69][70][71][72] .Across 449 genetically diverse PIs used in this study representing most of the genetic diversity among maturity groups 4 to 6, seed protein content was found to be significantly correlated with EAAs (Person's correlation ranging from 0.283 to 0.985) and seed oil content (− 0.697).It is worth noting that the collected phenotypic data demonstrated high consistency and reliability across environments with Cronbach's α scores averaging 0.86 across all seed composition traits.The negative phenotypic correlation between seed protein and oil contents has been previously reported in the literature 67,68,[72][73][74] .Although seed protein content may be a good proxy for the EAAs seed concentration, methionine (0.736) and tryptophan (0.283) were found to have a weaker correlation with seed protein content as compared to the remaining EAAs.Similar results have been reported where methionine and tryptophan showed a reduced correlation with seed protein content 67,68,72,75 .In addition, the negative correlation between EAAs and seed oil content (− 0.011 to − 0.673) was shown to be weaker and not proportional to the strength of the correlation between seed protein and oil contents.Tryptophan did not show a significant correlation with seed oil content (− 0.011).The low correlation between tryptophan and both seed protein and oil contents may be attributed to independent biological mechanisms, although the difficulty in measuring this Table 2. Summary of significant marker-trait associations for seed protein and oil contents, as well as essential amino acids concentration. 1 Chr, Chromosome. 2 Pro, seed protein content; Oil, seed oil content. 3Iso, isoleucine; Leu, leucine; Lys, lysine; Met, methionine, Phe, phenylalanine, Thr, threonine, Try, tryptophan; Val, valine. 4Values in bold are considered statistically significant (threshold of LOD > 5.0).

Figure 3.
Manhattan plot of marker-trait associations for seed protein and oil contents.
amino acid may also contribute to the observed low correlations 67,68,76 .Hence, these results engage the hypothesis that improving the seed protein content through manipulating the composition of specific EAAs may reduce the degree of pleiotropy between seed protein and oil.
In addition, the interaction between genotype, environment, and management practices (G×E×M) substantially impacts seed protein and oil contents as well as EAAs concentration.For instance, water deficit has been reported to reduce seed protein content while increasing seed oil content 37,77,78 .In contrast, increased precipitation has been associated with reductions in both seed protein and oil contents 79 .Higher temperatures during grain filling have been associated with lower seed protein and higher oil contents 35,37,74,80 , although the relationship between these variables may be dependent on genotype and geographical latitude 35 .Nitrogen fertilization as well as planting dates have also been reported to alter seed protein and oil contents 37,74,81 .To date, however, there are limited reports on the impact of different genetic backgrounds and various seed components on the across-environment stability of seed protein content.
In this study, the across-environment stability of seed protein content was calculated using Shukla's Stability Variance (σi 2 ) 48 and ranged from 0.12 (PI 398351) to 17.16 (FC 33243), with an average of 2.70.It indicates that, although environmental parameters and management practices impact seed protein content, genotypes may express different sensitivities to these stimuli.In summary, Shukla's Stability Variance is a univariate parametric method that estimates the variance (σ i 2 ) of the genotype by environment interaction (G×E) plus an error term associated with the genotype in multiple environments.Hence, it measures stability rather than performance based on the overall contribution of a genotype to the G×E sums of squares and the stability variance σ i 2 .A genotype with low σ i 2 shows minimal contribution towards G×E and therefore is considered stable.This methodology has been used to measure yield stability in soybean 82 and various crops including fava beans (Vicia faba L.) 83 , safflower (Carthamus tinctorius L.) 84 , winter bread wheat (Triticum aestivum L.) 85 , and maize (Zea mays L.) 86 .Nevertheless, a limitation of this method is the minimum generalization of measured stability in a different set of tested genotypes.In this study, however, 449 genetically diverse PIs within maturity groups 4-6 were included, which offsets the limited generalization of results since these genotypes cover most of the available genetic diversity within the tested maturity groups.
To further investigate the across-environment seed protein stability, a random forest model was developed using Shukla's Stability Variance as the response variable and the seed components as the explanatory variables.This approach has been used across multiple disciplines to identify relevant explanatory variables, including selection of molecular markers in genomic prediction 87 , selection of bands derived from hyperspectral data for trait estimation 88 , up to selection of explanatory variables to predict probable biological stream conditions 89 .Herein, explanatory variables had different variable importance towards the observed variation in seed protein  To the best of our knowledge, this is the first report exploring potential underlying causes of G×E affecting across-environment seed protein stability.It unveils opportunities to develop genotypes with more stable acrossenvironment seed protein content by manipulating the EAAs concentration through genetics and breeding.For instance, valine belongs to the branched-chain amino acids (BCAA) and is among the most hydrophobic of amino acids.This property is important for both the stability of folded protein and the folding pathway leading to the mature structure 90,91 .In swine, valine is a great energy source as it is one of the most energy-generating amino acids through the oxidation of branched-chain α-keto acid dehydrogenase complex 92 .Phenylalanine is the precursor of tyrosine 93 .In plants, phenylalanine is essential in the interconnection between primary and secondary metabolism.It is a precursor for numerous plant compounds that are essential for plant reproduction, growth, development, and stress defense 94 .Threonine is one of the most limiting amino acids in swine and poultry diets 95,96 .It is necessary for the synthesis of threonine-rich proteins in epithelial tissues and regulates lipid metabolism, embryonic stem cell proliferation and differentiation, and intestinal health including nutrient digestibility, gut microbiota, intestinal barrier, and immune function 97,98 .Still, further studies are needed to reveal the biological rationale behind the effect of certain EAAs toward across-environment seed protein stability, as well as validate the observed results across environments in various latitudes, as well as a range of genetically diverse genotypes in different maturity groups.
Previous studies have shown the quantitative nature of seed components and the minor effect markers associated with seed protein and oil content as well as EAAs concentration distributed across multiple chromosomes of the soybean genome 27,[99][100][101] .Herein, 56 SNP markers were significantly associated with seed composition traits.Eleven of these SNPs displayed significant associations with two or more traits.Notably, the marker-trait associations found for various seed components across multiple chromosomes unveil their complex genetic architecture and highlight the necessity for dissecting trait interactions and their complex G×E.These results also provide directions for improving the concentration of specific EAAs, as well as reducing the negative correlation between seed protein and oil contents.For instance, while soybean seed contains all EAAs, the concentration of the sulfur-containing EAA methionine is suboptimal for the nutritional needs of monogastric animals 102 .A total of 13 significant marker-trait associations were identified for methionine, of which none have shown significance to seed protein content.This indicates that it is likely possible to improve the seed concentration of methionine through genetics and breeding by targeting genomic regions and genetic resources independent of seed protein.This observation is consistent with previous studies that reported a weaker association between seed protein and methionine 72,75 .In addition, 10 significant marker-trait associations were identified for seed protein while five significant marker-trait associations were identified for seed oil.Interestingly, despite some of these associations occurring in the same chromosome, no overlapping was observed among significant SNPs providing potential molecular directions to uncouple the negative strong correlation between these traits.It is also worth highlighting that, although GWAS has been proven to be a strong tool for identifying significant marker-trait associations and unveiling the genetic architecture of traits of interest, genome-wide prediction models leveraging the genetic correlations among seed components can be explored to improve the identification and selection of genotypes with various desirable traits.

Conclusions
Manipulating soybean seed components including seed protein and oil content, as well as EAAs concentration, through plant breeding and genetics can maximize the overall value of the crop and its flexibility regarding post-processing applications.However, given the quantitative genetic architecture of such components, several complex interactions observed among seed components and between traits and environmental and management practices substantially challenge the improvement of soybean grain yield and seed components simultaneously.Herein, a thorough investigation of the sensitivity of a wide range of genetically diverse genotypes to environmental stimuli was assessed.It was reported that genotypes may respond differently to such stimuli in regards to seed protein content, and that specific seed components, particularly valine, phenylalanine, and threonine may play a significant role in the overall across-environment protein stability.In addition, various genomic regions have been highlighted to be associated with the variation of multiple seed components.Interestingly, many independent regions (i.e.associated with one or few seed components) were reported.Thus, the manipulation of EAAs through plant breeding and genetics may offset the strong negative correlation between seed protein and oil contents.

Figure 1 .
Figure 1.Phenotypic distribution of seed components and Pearson's correlation among traits.

Figure 2 .
Figure 2. Gini Importance of seed components toward the variation of across-environment seed protein stability.

Table 1 .
. Cronbach's alpha scores can be interpreted Summary of Cronbach's alpha (α) scores and trait metrics across environments.
aConfidence interval (95%) of standardized α score.b Estimated error variance was obtained by subtracting the squared α from 1.00.c Inter-item average Pearson's correlation.d Average of each trait across all observed entries reported on dry weight basis.e Standard Deviation of each trait across all observed entries reported on dry weight basis.

SNP Chr 1 Position Pro 2 Oil 2 Iso 3 Leu 3 Lys 3 Met 3 Phe 3 Thr 3 Try 3 Val 3 bp Logarithm of Odds (LOD)
www.nature.com/scientificreports/stabilityvariance.The EAAs valine(59.24),phenylalanine(40.19),andthreonine(33.84)showed the highest Gini Importance across all seed components, whereas both seed protein(20.93)andoil(19.85)contents were among the explanatory variables with the lowest Gini Importance.It indicates that total seed protein and oil contents are not associated with across-environment seed protein stability and that EAAs concentration may partially explain the differences in sensitivity to environmental stimuli in seed protein content across genotypes.